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NUCLEIC ACID-LEVEL ANALYSIS OF PROTEIN STRUCTURE 



ground of ike Invention 
The invention relates to methods of evaluating, altering, and designing protein 
5 structures. 



Summary of the Invention 

Methods of the invention incorporate considerations of mRNA sequence -and 

structure and eodon-antieodon energetics into die analysis and design of protein 
10 structure. Many prior art methods mr analyzing or designing protein structure have 

relied in part or in whole on analysis at the amino acid level Proteins, however, are the 

product of a process which involves a number of cellular entities and their interactions. 

The interaction ofmRNA molecules with the protein translation machinery (e.g., 

ribosomes, and tRNAs, as well as other elements of the cellular environment, such as 
1 5 water and salt molecules) and mRNA intrachain interactions place physico-chemical 

restraints on the overall process, 

Not only arc proteins the product of a process, but the process itself has evolved 
over time. Some constraints, e.g., those imposed by the interaction of mRNAs with 
20 environmental elements or with primitive ribosomal structures, those of mRNA structure 
and energetics, e.g„ the propensity to form secondary struetee, may have been more 
important, or at least different, primordial !y than they are currently, While not wishing 
to be bound by theory, the inventors postulate that evidence of those prior constraints 
may be seen in the sequence of current messages, 

25 

Methods of the invention provide for the analysis and design of protein structures 
on the basis of patterns or features of the nucleic acid message, e*g„ codon usage 
patterns or coding modalities. M ethods of the i nvention are based on di viding the 
genetic code, that is the eodon-anticodon pairs which specify amino acids (and stops), 

30 into classes, sometimes referred to herein as subcodes or coding modalities, mid 

evaluating a nucleic acid sequence which encodes a protein structure based on its class 
(i.e., subcode or coding modality). Relevant subcodes or coding modalities can be 
defined rising choice parameters which are a function of message-level properties, 
wherein each property is related to the composition or structure of the nucleic acid, and 

35 is other than the identity of the amino acid (or stop) encoded and other than codon bias. 
Examples of structural choice parameters, which can serve as methods or rules for 
assignment of codons into classes, include the nature of the substituents on the coding 
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bases (e.g, ? so-called keto-ricb bases H and G or ammo-rich bases A and C}* size of the 
coding bases (e*g.> purine- v : & pyrImidine};iiydfogett^ base-stacking energies 

of the coding bases in overlapping base pairs, and the like. Examples of compositional 
choice parameters include frequencies of subclasses of codons within more than one of 
5 the three alternative reading frames in which a nucleic acid message can be read. 
Alternative subcodes or coding modalities are not necessarily entirely disjointed, 
discrete, or unique, and identical subcodes or coding modalities can be obtained using 
structural and/or compositional parameters, 

1 0 Methods of the invention allow the identification, analysis, modification and 

design of protein structures on the basis of patterns or features revealed by the nucleic 
acid, e,g, ? the messenger nucleic acid. For example, the identification of a "run" of 
amino acids residues of a class can be indicative of an evoiutionariiy conserved region. 
The identification of a "minority " class codon in a nm of majority class codons can be 

15 indicative of a structure- or function-critical residue. The discovery of a critical residue 
can be used in the design or modification of a protein, e,g>, to develop a second 
generation protein. For example, in situations where it is desirable to alter structure or 
activity of a protein, it may be desirable to alter a critical resktue($) (or a residue which 
interacts with a critical residue, e.g., an adjacent residue or a residue elsewhere in the 

20 protein (or in another protein) with which it interacts), in the case where a change which 
does not result in significant alterations in structure or activity is desired residues other 
than the identified critical residue (or other than residues which interact with it) are 
changed. 

25 Methods of the invention provide for nearest neighbor frequencies calculated 

based upon the frequency or pattern of selected classes of cottons, ix% by codon class of 
the amino acid, and thus provide a higher degree of relevance for analysis of single- 
class-rich protein structures. Conventional tables of nearest neighbor amino acids do not 
take into account the classes described herein* and as such, provide only "average" 

30 values across mul tiple classes of codons. Also, unlike tables of the invention, 

conventional nearest-neighbor tables do not talee into account the fact that consistent 
secondary/tertiary structures of proteins can be shown to correlate with; a) "out of 
frame" properties of protein messages; b) "interframe" properties of protein messages, 
i.e., correlations between properties of messages read in frame 1 , properties of messages 

35 read in frame 2, and/or properties of messages read in frame 3, as defined below. 



In general, the invention features a method of evaluating protein structure. The 
method includes: 

providing a nucleic acid seance which encodes the protein structure; 

assarting bases of the nucleic acid sequence into subject triplets; and 
5 assign ing one or a plurality of subject. Triplets to one of a plurality of classes, 

wherein the assignment is a function of classifying triplets of the nucleic acid sequence 
as members of a class of a binary choice alphabet of n degrees of freedom, and wherein 
the classes can be generated by applying n binary choice parameters to a triplet to yield 
at least 2" classes of subject triplets, wherein a binary choice parameter is a function of a 
10 message-level property of the nucleic acid sequence. 

thereby evaluating the protein structure. Triplets can be assigned to a class based 
on whether they satisfy a value for the message level property, e.g., a triplet can be 
assigned to a class based on whether its value for a parameter is above or below a 
predefined value, e,g,, enthalpy for formation of a codon-anticodon duplex, or whether 
1 5 or not It possess a particular characteristic, e.g., whether it is GC rich. The message- 
level property is other than, the identity of the amino acid or punctuation which a triplet 
encodes and is other than eodon bias. 

The class constant table provides a measure of the frequency with which a first 
20 and a second amino add occur as nearest neighbors and wherein nearest neighbor 

frequencies are determined within a coclon class, and wherein a class is a function of a 
message level property of a nucleic acid, e.g., the eodon, which encodes an amino acid. 
The class can be any class generated by the binary choice parameter-based methods 
referred to herein, For example, if the classes are a first class, e .g. , high enthalpy codons 
25 and a second class,, e.g, ? low enthalpy codons, the table is generated for nearest 

neighbors where both neighbors are encoded by codons of either the first class or codons 
of the second class, 



In another aspect, the invention features a method of e val uating a protein 
30 structure. The method includes: providing a class-constant table of nearest neighbor 
relationships for amino acid residues; 

providing a nucleic acid which encodes a protein structure; and 
comparing one or a plurality of the observed nearest neighbor pairs in the protein 
structure with the frequencies provided by the class constant table, thereby evaluating 
35 the protein structure. 
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In preferred embodiments, the comparison can include: assigning an expected 
frequency from the class constant table to one or a plurality of the obser ved nearest 
neighbor pairs and determining how many of the observed nearest neighbor pairs fall 
a bo ve or below a predetermined value; detennimng the likelihood of occurrence, as 
5 predicted by the class constant table, for an observed nearest neighbor pair.; or 
determining if an observed nearest neighbor pair of a first and a second amino acid 
residue trom the protein structure is predicted by the class constant table to occur at a 
predetermined frequency, 

10 In another aspect, the invention features a method of evaluating a protein 

structure for resistance to change, e.g., evolutionary or mutational change. The method 
includes; 

identifying regions of a protein which is encoded by runs of a single subcode, 
thereby identifying regions which have been resistant to change and which are therefor 
15 predicted to be functionally or structurally significant. E.g., the method can include 
determining if the nucleic acid sequence which encodes the protein structure includes a 
run of triplets, e.g., a run at least 20* 40, 60, or 120 triplets in length, in which at least 20* 
40, 60, 80, 90 or 95 %, or all of the triplets in the run are from one class. Any of the 
way s of generating classes descri bed herein can be used in this method. 

20 

in another aspect, the invention includes, a method of evaluating a protein 
structure for the presence of critical amino acid residues. The method includes: 

identifying critical amino acid residues by identifying "minority codoiis' 4 in runs 
encoded by codons of a single class or subcode, thereby identifying residues which have 
25 been resistant to change and which are therefor believed to be functionally important 
Any of the ways of generating classes described herein can be used in this method. 

In another aspect, the invention features a method for evaluating a protein 
structure. The method includes: 
30 providing a nucleic acid sequence \vhich encodes the protein structure; 

assorting bases of the nucleic acid sequence into subject triplets; and 
assigning at least one of the subject triplets to one of a plurality of classes, 
wherein the assignment is a function of classifying the subject triplets of the nucleic acid 
sequence under a binary choice alphabet of n degrees of freedom by applying n binary 
35 choice parameters to a triplet to yield at least 2* 1 classes of subject tripleis, wherein the 
assignment provides at least four classes of triplets, the at least four classes of triplets 
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being represented in at least a portion of the nucleic acid seq uence in a ratio of about 

3:5:3:5; 

thereby evaluating the protean structure. 

5 In another aspect, the invention features a method &r Identifying coding regions 

of a nucleic acid sequence, the method comprising: 
providing the nucleic add sequence; 

assorting bases of at least a portion of the nucleic acid sequence into a plurality 
of subject triplets; 

1 0 assigning the plurali ty of subject triplets to one of a plurality of classes, wherein 

the assignment is a function of classifying the subject triplets of the nucleic acid 
sequence under a binary choice alphabet of n degrees of freedom by applying n binary 
choice parameters to a triplet to yield at least 2 n classes of subject triplets, wherein the 
assignment provides at least four classes of triplets A, B, C, and D: 

1 5 determining whether the plurality of subject triplets fire distributed into the at 

least four classes of triplets A:B:C:D in a ratio of about 3:5:3:5; 

thereby identify ing coding regions of the nucleic acid sequence, 

* 

In another aspect, the invention features, a method for identifying a protein that 
20 includes a polypeptide portion which is structurally or functionally similar to all or a 
portion of a test protein, the method comprising: 

providing a nucleic acid sequence which encodes al! or a portion of the test 
protein; 

assorting bases of at least a portion of the nucieic acid sequence into a plurality 
25 of subject triplets m a first reading frame; 

assigning the plurality of subject triplets in the first reading frame to one of a 
plurality of classes, wherein the assignment is a function of classifying the subject 
triplets of the nucleic acid sequence under a first binary choice alphabet of n degrees of 
freedom by applying n first binary choice parameters to a triplet to yield at least 2" 
30 classes of subject triplets, wherein the assignment provides at least four classes of 
triplets distributed in a ratio of about 3:5:3:5; 

assorting bases of the at least a portion of the nucleic acid sequence into a 
plurality of subject triplets in a second reading frame; 

assigning the plurality of subject triplets in the second reading frame to one of a 
35 plurality of classes, wherein the assignment h a function of classifying the subject 

triplets of the nucleic acid sequence under a second binary choice alphabet of n degrees 
of freedom by applying n second binary choice parameters to a triplet to yield at least 2" 



classes of subject triplets, wherein fee assignment provides at least four classes of 
triplets distributed in a ratio of about 3:5:3:5; and 

identifying a protein which includes a polypeptide portion encoded by the 
plurality of triplets in the second reading frame; 
5 thereby identifying a protein that includes a polypeptide portion which is 

structurally or functionally similar to all or a portion of the test protein. 

In another aspect the invention features, a method for identifying a mutation- 
prone region of a nucleic acid sequence, e.g., a viral nucleic acid sequence. The method 
10 includes: 

providing the nucleic acid sequence; 

assorting bases of at least a portion of the nucleic acid sequence into a plurality 
of subject triplets in a first reading frame; 

assigning the plurality of subject triplets in the first reading frame to one of a 
1 5 plurality of classes, wherein the assignment is a function of classifying the subject 
triplets of the nucleic acid sequence under a binary choice alphabet, of n degrees of 
freedom by applying n binary choice parameters to a triplet to yield at least 2" classes of 
subject triplets, wherein the assignment provides at least four classes of triplets 
distributed in a ratio of about 3:5:3:5; 
20 assorting bases of the at least a portion of the nucleic aci d sequence into a 

plurality of subject triplets in a second reading frame; and 

assigning the plurality of subject triplets in the second reading frame to one of a 
plurality of classes, wherein the assignment is a function of classifying the subject 
triplets of the nucleic acid sequence under a binary choice alphabet of n degrees of 
25 freedom by applying n binary choice parameters to a triplet to yield at least 2 n classes of 
subject triplets, wherein the assignment provides at least four classes of triplets 
distributed in a ratio of about 3:3:3:5; 

thereby identifying a mutation-prone region of the nucleic acid sequence. 

3 0 In another aspect, the invention includes, a method of providing a protein 

structure, e.g., the structure of a protein of known function, in which one or a plurality of 
amino acid residues are changed. The method includes: 

providing a nucleic acid sequence which encodes a candidate protein structure; 
evaluating the sequence by a method described hereto; and 
35 altering one or a plurality of amino acid residues in the candidate protein 

structure, 

thereby providing a protein structure. 



in yet another aspect, the invent tan features, a machine-readable data storage 
medium, including a data storage material encoded with machine readable data which, 
when used with a machine programmed with instructions for using the data, is capable of 
5 storing, retrieving, or displaying databases, binary choice alphabets, protein sequences, 
nucleic acid sequences of the in vention. The storage medium can be used in methods of 
the invention, In preferred embodiments the storage medium is recorded with: a class 
constant nearest neighbor table; the classes into which the triplets o f a nucleic acid are 
assigned; or nucieic acid sequence which encodes or protein structure which is to be 
10 analyzed or which has been altered by application of a method described herein. 

Methods referred to herein can further include creating a record of one or more 
protein structures to be analyzed or modified, e.g., proteins, protein portions or 
fragments, or nucleic acids which encode all or part of such protein structure. The 
i 5 protein or nucleic acid structure which is to be analyzed or modified, or the structure 
which has been identified, evaluated or modified, or both, can be recorded. The record 
can be encoded in the form of a machine-readable data storage medium. The recorded 
structure, e.g., a nucleic acid or amino acid sequence, can be displayed on a machine, 
e.g., on a monitor, or in printed form, 

20 

Methods referred to herein can further include providing an identified or 
modified substance, e.g., a protein or nucleic acid, e,g,, chemically synthesizing the 
identified substance based on the structure identified by way of the methods described 
herein. In preferred embodiments, the method includes assessing the biological activity 
25 of the identified substance. The biological activity of the identified substance can be 
assessed in vitro or in vivo, in preferred embodiments, tire identified substance can be 
combined with a carrier suitable for introduction into any living cell or organism, e.g. t m 
animal model, e r g,, naturally derived or synthetic polymers, solvents, dispersion media, 
coatings, antibacterial and antifungal agents and the like. 

30 

Methods referred to herein can further include providing a three dimensional 
representation of the protein structure, or a representation of the primary sequence of the 
protein structure, either before or after a modification. The structure can be compared to 
the candidate structure or can be evaluated for the ability to exhibit a predetermined 
35 structure, e.g,, possession of a structural component such as a helix, or a turn segment, 
an activity, e.g.. the ability to dock with a second protein. 



WO 98/18814 



-8- 



hi methods referred to herein the nucleic acid sequence can be any of: a genomic 
sequence; an mRNA sequence; a sequence which encodes a protein structure of known 
fonction; a sequence for which the reading frame, if it exists, is known; a sequence for 
which the reading frame, if it exists, is unknown; a .sequence which includes a coding 
5 portion; a sequence which includes a non-coding portion; or a sequence from a 
multiproiein data base. 

Methods of the invention allow a wide variety of information to be extracted 
from nucleic acid sequences and allow a wide variety of useful manipulations, e.g., the 

10 identification of useful protein structures and the design of improved or altered function 
protein structures, These include, but are not limited to: 

providing a protein structure encoded by codons of a first subcode which lias a 
predetermined property of a prot ein structure encoded by codons of a second class or 
subcode. This allows: provision of a protein structure having a novel amino acid 

15 sequence hut which has a desired property, e.g., secondary structure, of a known protein; 
provision of protein structure with impro ved or altered function; 

identifying regions of proteins which are encoded by runs of codons of a single 
ciass or subcode, thereby identifying regions which have been resistant to evolutionary 
or mutational change and which may therefore be functionally important; 

20 identifying a critical amino acid residue(s) in a protein structure by identifying 

"minority codons" in runs encoded by codons of a single class or subcode , thereby 
identifying amino acid replacements which, although disfavored at the mRNA level, 
exhibit sufficiently favored characteristics at the protein level that they have been 
maintained and may therefore be functionally important: 

25 determination of nearest neighbor relationships based upon nearest neighbors 

encoded by codons drawn from the same class or subcode; 

distinguishing a coding region from a non-coding region by determining whether 
the region obeys nearest neighbor relationships involving codons drawn from foe same 
class or subcode; 

30 assignment of function (or structure) to a protein or polypeptide of unknown 

structure by recognizing codon patterns in message-level nucleic acid which encodes the 
protein or polypeptide structure of unknown function (e.g., the protein or polypeptide is 
encoded in a first subcode) similar to codon patterns in message-level nucleic acid which 
encodes the structure of known function (but different primary sequence) (e*g*> which is 

35 encoded by a second subcode), 
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As used herein, "protein structure" refers to a structure of at least two amino 
acids linked by a peptide bond, A protein structure can include an entire protein, or a 
5 part thereof! For example, a protein structure can include a domain or other region 
having a characteristic structural, chemical, or biological property. Examples of 
structural elements include helices, aims, sheets, helix-turn structure; tertiary amino acid 
structure; and the like. Examples of chemical properties include net charge, side chain 
bulk, side chain charge, acidity, micteophilicity , hydrophobicity, and the like. Examples 
1 0 of biological properties include catalytic activity, promoter or suppressor activity, ability 
to bind to or interact with a second molecule such as DNA, RNA, a protein, a metal 
atom, immunological activity, and the like, Examples of known domains which can be 
included in protein structures include: zinc fingers, binding regions, and the like, A 
protein structural element can be from a naturally occurring protein or can be a non- 
15 naturally occurring (e.g., a novel) construct The protein structure can be of a 

predetermined length, in preferred embodiments it is at least 8, 16, 32, 64 or 128 amino 
acids in length. 

As used herein, a predetermined property is a property other than the sequence of 
amino acids, and can include one or more of the following; (1) three dimensional 

20 structure, e.g„ secondary structure, tertiary structure, or quaternary simctnre; (2) a 

charge-related property, e.g., due to positively or negatively charged side chain residues, 
including, but not limited to: the presence of a predetermined charge at a predetermined 
location in the sequence, the net charge on a protein or polypeptide, and the like; (3) 
hydrophobkity, e.g., due to the presence of water-insoluble side-chain residues; (4) an 

25 activity associated with an intramolecular interaction or an intermolecular interaction, 
intermoiecutar interactions include binding activity, catalytic activity, and the like. 
An "amino acid alphabet/ 5 as used herein, refers to a group of codoiis which 
encode amino acids or stop codons. 

As used herein, a "binary choice" amino acid alphabet of n degrees of freedom, 

30 refers to an amino acid alphabet which is structured into 2* subcodes, by the application 
of binary choices dictated by n choice parameters, and where a choice parameter is a 
function of nucleic acid sequence and/or codon patterns of the nucleic acid (e.g., an 
mRNA), 

A " binary choice parameter 11 or "opposition/' as used herein, refers to a 
35 parameter by which a polynucleotide eodon or triplet can be assigned one of two values . 
The assigned values allow the triplets to be assigned to classes. It will be appreciated 
that application of more than one non-degenerate binary choice parameter can divide 
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triplets into more than two classes. The division into classes can he based on a 
predetermined value, Eg., all triplets with a value less than the predetermined value are 
in one class and ail with values above the predetermined values are in a second class, or 
all triplets having predetermined characteristic a, e.g. , being pyrimidine-rich, are in a 
5 first class and all codons being pyrimidine-poor are in a second class, 

The term - 'coding modality; 1 as used herein, refers to a pattern of codon usage in 
a nucleic acid message, e.g., the frequency that one or more codons appears in a nucleic 
acid sequence, the relative frequency thai one or more codons appears in two or more 
reading frames of a nucleic acid message, and the like , 

10 A "triplet", as used herein, refers to three contiguous (sequential) nucleic acid 

residues (e.g., read in the 5 ! -3* direction along the nucleic acid strand). A triplet can be a 
codon (e.g., when a coding nucleic acid sequence is read in the coding frame) or can be a 
non-reading frame triplet or non-coding triplet, 

A leading triplet, as used herein, refers to a triplet which is 5* to the most 3* base 

15 in the subject triple. Thus, in a sequence 12345, the leading triplet is 123. 

A final triplet, as used herein, refers to a triplet which is 3 1 to the most 5* base in a 
subject triple. Thus, in a sequence 12345, die final triplet is 345. 

A class of triplets, as used herein, refers to all triplets which tali within a 
particular subgroup of triplets under a selected binary choice alphabet. 

20 A message-level property, as used herein, refers to a property of a nucleic acid 

(e,g. v mRNA) of three or more bases in length, which property is other than the identity 
of or physi cal or chemical property of art annuo acid (or punctuation) encoded by the 
nucleic acid (wherein such physical and chemical characteristics include, e.g., size, 
hydrophobic^ hydrophiiicity), and is other than eodon-hias, Structural message-level 

25 properties include physical and energetic properties of the nucleic acid. Examples 
include: OA-rich triplets vs, CO-rich triplets; UO-ricb triplets vs. AC-rich triplets; 
purinenrich ("R-rich") triplets vs. pyrimidine-rich ("Y-rich") triplets; assigning a 
plurality of codons in said sequence to {! ) either a Y-rieh subcode or an R-rieh subcode 
and (2) to either an E-rich (UG-rich) subcode or an M-rich (AC-rich) subcode, 

30 Compositional message4evel properties include frequencies of particular codon groups 
in one or more reading frames of a message. 

The term "reading frame" is known in the art and refers to a frame for reading, 
e.g,, translating, a nucleic acid message. For example, a sequence of nucleotides 
123456789 can be read in three reading frames (e.g., in groups of three nucleotides, each 

35 triplet being a codon): Reading Frame 1 : 1 23 456 789; Reading Frame 2; 234 567; or 
Reading Frame 3: 345 678, 
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"Evaluating protein structure," as used herein, refers to determinkig properties of 
a protein or polypeptide. For example, e valuating protein structure includes: 
deiennining the three-dimensional structure of a protein or polypeptide; comparing the 
threes mensionai structure of a known protein or polypeptide with that of an unknown 
5 protein or polypeptide; determining the function of a protein or polypeptide; comparing 
the function of a known protein or polypeptide with that of an unknown protein or 
polypeptide; and die like. 

Other features and advantages of the inven tion will he apparent from the 
following detailed description, and from the claims. 

10 

Brief Description of the Drawings 

Figure 1 schematically depicts alternate reading frames for a nucleic acid 
message. 

15 Figure 2 depicts the distinction between "wildcard" and "constant*' codon 

doublets. 

Figure 3 shows the 64 codons divided into four groups based on the "wildcard" 
and "constant" distinction and the leading base of the codon. 

Figure 4 shows the frequencies of codons in the groups of Figure 1 in a test 
20 mRNA database. 

Detailed Description 

In general the invention features, a method of evaluating protein structure. The 
25 method includes: 

providing a nucleic acid sequence which encodes the protein structure; 
assorting bases of the nucleic acid sequence into subject triplets: and 
assigning one or a plurality of subject triplets to one of a plurality of classes, 
wherein the assignment is a {unction of classifying triplets, e.g., a subject triplet or a 
30 leading and following triplet of the subject triplet, of the nucleic acid sequence as 

members of a class of a binary choice alphabet of n degrees of freedom, and wherein the 
classes can be generated by applying n binary choice parameters to a triplet to yield at 
least 2 U classes of subject triplets, wherein a binary choice parameter is a (unction of a 
message-level property of the nucleic acid sequence, 
35 thereby evaluating the protein structure. Triplets can be assigned to a class 

based on whether they satisfy a value for the message level property, e.g., a triplet can be 
assigned to a class based on whether its value for a parameter is above or below a 



predefined value, or whether or not it possess a particular characteristic, e.g., whether k 
is GC rich. 

The message-level property ; Is other than, the identity of the amino acid or 
5 punctuation which a triplet encodes and is other than codon bias. 

in preferred embodiments the method includes making a record, e,g< ? on a 
machine readable inediotn, of the class assigned to one or more triplets. 

1 0 In preferred embodiments: n is chosen from the integers 1 t 2, 3, and 4, 

In preferred embodiments the message-level property is a function of a physical 
or chemical properly of one or more bases of a nucleic acid; is a I unction of a physical or 
chemical property which affects the tendency of a nucleic acid to form secondary 
15 structure. 

In preferred embodiments triplets are assigned to a first and a second class: 
the first class having the property that a message made of triplets drawn 
exclusively from the first class is less likely to form secondary (intrachain) structure than 
20 is a message which is made of triplets from both the first class and the second class of 
triplets, and 

the second class having the property that a message made of triplets drawn 
exclusively from the second class is less likely to form secondary (intrachain) structure 
than is a message which is made of triplets from both the first class and the second class 
25 of triplets. 

In preferred embodiments the .message-level property is: afunctioiiof the UA 
content of a subject triplet; a function of the GC content of a subject triplet; a function of 
the size or molecular weight of a triplet; a function of whether the triplet is keto rich or 
30 amino rich; a fonetion of whether the triplet is purine rich or pyrimidine rich ; a function 
of a the enthalpy of the interaction between the triplet arid a fully or partially 
complementary nucleic acid, 



35 



In preferred embodiments: the binary choice parameter is applied to the subject 
triplet, e,g, 5 applied to the codon which encodes an amino acid, to place a subject triplet 
in a class. 
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In preferred embodiments: the class into which a subject triplet is assigned is a 
function of: 

(1 ) providing a value for a subject triplet of bases 456, wherein the value is a 
function of the application of a binary choice parameter to a first set of contiguous bases 

5 which includes all o r a subset of the bases of the subj ect triplet, e.g.. bases 4 and 5 and of 
the application a binary choice parameter to a second, different, set of contiguous bases 
which includes all or a subset of the bases of the subject triplet, e.g,, bases 5 and 6; and 

(2) assigning a plurality of subject triplets to a first class, and a plurality of 
triplets to a second class, as a function of subjec t triplet value, 

10 

In preferred embodiments: the class into which a subject triplet is assigned is a 
function of: 

( 1 } providing a value for a subject triplet of bases 456, wherein the value is a 
function of the application of a binary choice parameter to a first subset of the bases of 
15 the subject triplet, e.g.* 4 and 5 ? and of the application a binary choice parameter to a 
second, different subset of the bases of the subject triplet 

(2) assigning a plurality of subject triplets to a first class, and a plurality of 
triplets to a second class, as a function of subject triplet value, 

20 In preferred embodiments: the class into which a subject triplet is assigned is a 

function of: 

( I ) providing a value for a subject triplet of bases 456, wherein the value is a 
function of (S 1 + S 2 }/2, wherein a function of the application of a binary choice 
parameter (e.g., the value for enthalpy of anticodon~codon formation above or below a 
25 predetermined value) to a first subset of the bases of the subject triplet, e,g., bases 4 and 
5 of the subject triplet, and S 2 is a function of the application of a binary choice 
parameter to a second, different, subset of the bases of the subj ect triplet, e.g., bases 5 
and 63 of the subject triplet* and 

(2) assigning a plurality of subject triplets to a first class, and a plurality of 
30 triplets to a second class. 

In preferred embodiments: the class into which a subject triplet is assigned is a 
function of the application of a binary choice parameter to one or both of a leading 
triplet or a final triplet of the subject triplet, 

35 

in preferred embodiments: the class into which, a subject triplet is assigned is a 
function of: 
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(1) providing a value, e.g., enthalpy, of a triplet of bases 456, wherein the value 
is a lunction of(S* + S 2 )/2, wherein S 1 is the value, e.g., enthalpy, of the base pair 
doublet 45 of the subject triplet, and S 2 is the value, e.g., enthalpy, of the base pair 
doublet 56 of the subject triplet* and 
5 {2} assigning a plurality of subject triplets to a first class, e.g,, a low enthalpy 

class, and a plurality of triplets to a second class, e.g» a high enthalpy class. 

In preferred embodiments: a subject triplet 456 of a nucleic acid sequence of 
bases 123 456 789 is assigned into a class as a function of: 
1 0 (!) performing one or more of 0) 5 (ii), and (Hi) 

(1) applying a binary choice parameter to a leading triplet of 456, e.g., to one or 
more of triplet 123, 234 ? or 345, to yield a leading value; 

(ii) applying a binary choice parameter to 456, to provide a center value; 
(in) applying a binary choice parameter to a following triplet of 456 T e.g., to one 
1 5 or more of triplet 567 ? 678, or 789, to yield a following value; 

(2) assigning one or a plurality of s ubject triplets 345 into a class based on the 
values determined in one or more of (1 ), (3) and (3). 

thereby assi gning one or a plurality of subject triplets into classes. 

20 In preferred embodiments: the class into which a subject triplet is assigned is a 

iuoetioii of the application of a first binary choice parameter to a leading triplet and a 
second binary choice parameter to a following triplet of a subject triplet. 

In preferred embodiments: the evaluation includes determining if the nucleic 
25 acid sequence includes a rim of triplets* e.g., a run at least 20, 40, 60, or 1 20 triplets in 
length, in which at least 20, 40, 60, 80, 90 or 95 %, or all, of the triplets in the run are 
from a first class. The method allows for evaluating a protein structure for resistance to 
change, e.g., evolutionary or mutational change, by identifying regions of the protein 
which structure encoded by a run of a single class or subcode* thereby identifying 
30 regions which have been resistant to change and which are therefor predicted to be 
functionally or structurally significant. In preferred embodiments a codon, preferably 
within the run, is changed so as to alter the sequence of the encoded amino acid to 
provide an altered sequence. 

35 In preferred embodiments; the evaluation comprises identifying a triplet from a 

first class in a run of triplets of a second class, e.g., a run at least 20, 40, or 60 codons in 
length, in which at least 20, 40, 80* 90 or 95 %, or all, of the codons are from the second 
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class, thereby identifying the triplet of the first class as encoding a critical residue, e,g, s a 
structure or function critical residue. In a preferred embodiment a eodon is changed so 
as to alter the amino acid encoded by the critical residue, a residue adjacent to the critical 
residue, or a residue which interacts with the critical residue, and thereby provide an 
5 altered sequence. 

In preferred embodiments : the nucleic acid encodes a protein structure of known 
or unknown i unction. 

10 In another aspect > the invention features, a eias$*constant table of nearest 

neighbor relationships for amino acid residues which provides, for each of a plurality of 
class constant nearest neighbors, a frequency of occurrence which is a function of the 
occurrence of the class constant nearest neighbor pair in a collection of protein 
structures, e,g k? a collection of at least 1(X SO, 100, or 500 proteins. 

15 

The class constant table provides a measure of the frequency with which a first 
and a second amino acid occur as nearest neighbors and wherein nearest neighbor 
frequencies are determined within a eodon class, and wherein a class is a function, of a 
message level property of a nucleic acid, e.g., the eodon, which encodes an amino add, 
20 The class can be any class generated by the binary choice parameter-based methods 

referred to herein, For example, if the classes are a first class, e\g„ high enthalpy codons 
and a second class, e.g., low enthalpy codons* the table is generated for nearest 
neighbors where both neighbors are encoded by codons of either the first class or codons 
of the second class. 

25 

In preferred embodiments: the assignment of amino acids into a class is done by 
assigning a eodon which encodes it into a class as a function of classifying triplets, e.g.* 
the subject eodon or a leading and following triplet of the subject eodon, as a member of 
a binary choice alphabet of n degrees of freedom by apply ing n binary choice 
30 parameters to a triplet to yield at least 2 n classes of triplets, wherein a binary choice 
parameter is a function of a message-level property of the nucleic acid sequence, 

The table can be recorded on a machine readable medium. 

35 In another aspect the invention feat ures, a method of e valuating a protein 

structure. The method includes : providing a class-constant table of nearest neighbor 
relationships for amino acid residues; 
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providing a nucleic acid which encodes a protein structure; and 

comparing one or a plurality of the observed nearest neighbor pairs in the protein 

structure with the frequencies provided by the class constant table, thereby evaluating 

the protein structure, 

5 The class constant table provides a measure of the frequency with which a first 

and a second amino acid occur as nearest neighbors and wherein nearest neighbor 
frequencies are determined within a codoa class, and wherein a cl ass is a function of a 
message level property of a nucleic acid, e.g., the codon, which encodes an amino acid. 
The class can be any class generated by the binary choice parameter- based methods 
10 referred to herein, For example, if the classes are a first class, e.g., high enthalpy codons 
and a second class, e.g., low enthalpy codons, the table is generated for nearest 
neiglibors where both neighbors are encoded by codons of either the first class or codons 
of the second class. 

15 In preferred embodiments^ die comparison can include: assigning an expected 

frequency from the class constant tabic to one or a plurality of the observed nearest 
neighbor pairs and determining how many of the observed nearest neighbor pairs fall 
above or below a predetermined value; determining the likelihood of occurrence, as 
predicted by the class constant table, for an observed nearest neighbor pair.; or 

20 determini ng if an observed nearest neighbor pair of a first and a second amino acid 
residue from the protein structure is predicted by the class constant table to occur at a 
predetermined frequency. 

In preferred embodiments; the assignment of amino acids into a class is done by 
25 assigning a eodon which encodes it into a class as a function of classifying triplets, e.g., 
the subject codon or a leading and following tr iplet of the subject eodon, as a member of 
a binary choice alphabet of n degrees of freedom by applying n binary choice 
parameters to a triplet to yield at least 2" classes of triplets, wherein a binary choice 
parameter is a function of a message-level property of the nucleic acid sequence. 

30 

In preferred embodiments the method includes making a record of observed 
class constant nearest neighbors in the protein structure on a machine-readable medium. 



35 



In p^ferred embodiments: the method further includes determining if an 
observed nearest neighbor of the protein structure is that predicted, at a predetermined 
frequency, by the table, thereby evaluating the protein structure. 
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hi preferred embodiments the method can be used to identify coding regions in a 
nucleic acid sequence, A coding region can be identified by comparing observed nearest 
neighbors in the protein structure with a class constant nearest neighbor table, the 
presence of observed pairs which correspond to predicted pairs in the table being 
5 predictive of a coding region. In a preferred embodiment, a eodon in the coding region 
is changed so as to alter its encoded amino acid. 

In preferred embodiments the method can identify structure or function critical 
residues, the occurrence of a nearest neighbor of iow probability being predictive of a 
10 critical amino acid residue. In a preferred embodiment, a cod on is changed so as to alter 
the amino acid encoded by the critical residue, a residue adjacent to the critical residue, 
or a residue which interacts with the critical residue. 

In preferred embodiments: the protein structure is from a protein of known or 
1 5 unknown function , 

hi preferred embodiments; the protein structure is evaluated for the presence of a 
first nearest neighbor with a predicted occurrence below a predetermined value which is 
located in a run of residues, wherein at least 20, 40, 80. 90 or 95 % of the residues in the 
20 run are members of nearest neighbors pairs having an expected frequency from the tabic 
of greater than a predetermined value, thereby identi fying a critical residue, in a 
preferred embodiment, a codon is changed so as to alter the amino acid encoded by the 
critical residue, a residue adfacent to the critical residue, or a residue which interacts 
with the critical residue, 

25 

In preferred embodiments: the nearest neighbor includes or is adjacent to a 
critical residue. 

In another aspect, the invention includes, a machine-readable medium on which 
30 is recorded a class-constant nearest neighbor table, 

in another aspect, the invention features a method of evaluating a protein 
structure for resistance to change, e.g., evolutionary or mutational change. The method 
includes; 

35 identifying regions of a protein which is encoded by runs of a single subcode, 

thereby identifying regions which have been resistant to change and which are therefor 
predicted to be functionally or structurally significant E.g., the method can include 
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determining if the nucleic acid sequence which encodes the protein structure includes a 
run of triplets, e*g., a run at least 20, 40, 60, or 120 triplets in length (or e,g,, 16, 32, 48, 
64, 1 28, or 256 triplets in length), in which at least 20, 40. 60, 80, 90 or 95 %, or all, of 
the triplets in the run are from one class. Any of the ways of generating classes 
5 described herein can be used m this method. 

In another aspect, the invention includes, a method of evaluating a protein 
structure for the presence of critica! amino acid residues. The method includes: 

identifying critical amino acid residues by identifying "minority eodons" in runs 
10 encoded by codons of a single class or subcode, thereby identifying residues which have 
been resistant to change and which are therefor believed to be functionally important. 
Any of the ways of generating classes described herein can be used in this method. 

In preferred embodiment: the evaluation comprises identifying a triplet from a 
1 5 first class in a s run of triplets of a second class, e.g., a run at least 20, 40, or 60 codons 
in length, in which at least 20, 40, 60, 80, 90 or 95 %, or all, of the codons are from the 
second class, thereby identifying the triplet of the first class as encoding a critical 
residue, e.g., a structure or function critical residue. In a preferred embodiment, a codon 
is changed so as to alter the amino acid encoded by the critical residue, a residue 
20 adjacent to the critical residue, or a residue which interacts with the critical residue. 

In another aspect, the invention features, a method for evaluating a protein 
structure. The method includes: 

providing a nucleic acid sequence which encodes the protein structure; 
25 assorting bases of the nucleic acid sequence into subject triplets; and 

assigning at least one of the subject triplets to one of a plurality of classes, 
wherein the assignment is a function of classify ing the subject triplets of the nucleic acid 
sequence under a binary choice alphabet of n degrees of freedom by applying n binary 
choice parameters to a triplet to yield at least 2" classes of subject triplets, wherein the 
30 assignment provides at least four classes of triplets, the at least four classes of triplets 
being represented in at least a portion of the nucleic acid sequence in a ratio of about 
3:5:3:5; 

thereby evaluating the protein structure. 



35 



In preferred embodiments nisi, 2> 3> or 4 
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In prefer Ted embodiments the method includes making a record, e*g, ? on a 
machine-readable medium, of the class assigned to one or more triplets. 

In preferred em bodiments, the classes can he generated by application of a binary 
5 choice parameter referred to herein. 

In --another aspect, the invention features, a method for identifying coding regions 
of a nucleic acid sequence, the method .comprising: 
providing the nucleic acid sequence; 
1 0 assorting bases of at least a portion of the nucleic acid sequence into a plurality 

of subject triplets; 

assigning the plurality of subject triplets to one of a plural ity of classes, wherein 
the assignment is a function of classifying the subject triplets of the nucleic acid 
sequence under a binary choice alphabet of n degrees of freedom by applying n binary 
15 choice parameters to a triplet to yield at least 2* classes of subject triplets, wherein the 
assignment provides at least four classes of triplets A, B 3 C, and D; 

determining whether the plurality of subject triplets are distributed into the at 
least four classes of triplets A;B:C:D in a ratio of about 3:5:3:5; 

thereby identifying coding regions of the nucleic acid sequence. 

20 

In preferred embodiments n is h 2, 3, or 4. 

In preferred embodiments (A^B)/(Cf D) is about one. 

25 In preferred embodimems (A^D)/(B+C) is about one. 

In preferred embodiments the method includes making a reootds e,g. 5 on a 
machine-readable medium, of the class assigned to one or more triplets, 

30 In preferred embodiments, the classes can be generated by application of a binary 

choke parameter referred to herein, 

in another aspect, the invention features,, a method for identifying a protein that 
includes a polypeptide portion which is structurally or functionally similar to all or a 
35 portion of a test protein, the method comprising; 

providing a nucleic acid sequence which encodes ail or a portion of the lest 
protein; 
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assorting bases of at least a portion of the nucleic acid sequence into a plurality 
of subject triplets in a first reading frame; 

assigning the plurality of subject triplets in the first reading frame to one of a 
plurality of classes, wherein the assignment is a function of classifying the subject 
5 triplets of the nucleic acid sequence under a first binary choice alphabet of n degrees of 
freedom by applying n first binary choice parameters to a triplet to yield at least T 
classes of subject triplets, wherein the assignment provides at least four classes of 
triplets distributed in a ratio of about 3:5:3:5; 

assorting bases of the at least a portion of the nucleic acid seq uence into a 
1 (} plurality of subject triplets in a second reading frame; 

assigning the plurality of subject triplets in the second reading frame to one of a 
plurality of classes, wherein the assignment is a function of classifying the subject 
triplets of the nucleic acid sequence under a second binary choice alphabet of n degrees 
of freedom by applying n second binary choice parameters to a triplet to yield at least 2 J> 
1 5 classes of subject triplets, wherein the assignment provides at least four classes of 
triplets distributed in a ratio of about 3:5:3:5; and 

identifying a protein which includes a polypeptide portion encoded by the 
pl urality of triplets in the second reading frame; 

thereby identifying a protein that includes a polypeptide portion which is 
20 structurally or functionally similar to all or a portion of the test protein. 

In preferred embodiments each of the first and second binary choice alphabets, n 
is L 2, 3, or 4,.n is two, 

25 In preferred embodiments {A^B)/(C+D) is about one. 

In preferred embodiments (A+D)/(B+C} is about one. 

In preferred embodiments the first reading feme is frame 1 and the second 
30 reading frame is frame 2 or 3. 

In preferred embodiments the method includes making a record, e.g>, on a 
machine-readable medium, of the class assigned to one or more triplets, 

3 5 In preferred embodiments, the classes can he generated by application of a binary 

choice parameter referred to herein. 



In preferred embodiments the step of identifying a protein which includes a 
polypeptide portion encoded by the plurality of triplets in the second reading feme 
comprises reading all or a portion of a protein sequence from a database of protein 
sequences. 

5 

In another aspect, the invention features, a method for identifying a mutation- 
prone region of a nucleic acid sequence, e.g,, a viral nucleic acid sequence. The method 
includes: 

providing the nucleic acid sequence; 
1.0 assorting bases of at least a portion of the nucleic acid sequence into a plurality 

of subject triplets in a first reading frame; 

assigning the plurality of subject triplets in the first reading frame to one of a 
plurality of classes, wherein the assignment: is a function of classifying the subject 
triplets of the nucleic acid sequence under a binary choice alphabet of n degrees of 
1 5 freedom by applying n binary choice parameters to a triplet to yield at least 2" classes of 
subject triplets, wherein the assignment provides at least ibur classes of triplets 
distributed in a ratio of about 3:5:3:5; 

assorting bases of the at least a portion of the nucleic acid sequence into a 
plurality of subject triplets In a second reading frame; and 
20 assigning the plurality of subject triplets in the second reading frame to one of a 

plurality of classes, wherein the assignment is a function of classifying the subject 
triplets of the nucleic acid sequence under a binary choice alphabet of n degrees of 
freedom by applying n binary choice parameters to a triplet to yield at least 2 U classes of 
subject triplets, wherein the assignment provides at least four cl asses of triplets 
25 distributed in a ratio of about 3:5:3:5; 

thereby identifying a mijtation-prone region of the nucleic acid sequence. 

In preferred embodiments the method includes making a record* e.g., on a 
machine-readable medium, of the class assigned to one or more triplets. 

30 

in preferred embodiments, the classes cat) be generated by application of a binary 
choice parameter referred to herein. 

35 Structural binary choice parameters can be selected from a variety of physical or 

physico-chemical qualities related to the structure of the nucleic acid (polynucleotide) 
sequence, including the primary or secondary structure of the nucleic acid sequence, the 
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physical or chemical nature of the nucleotide bases, the physical or chemical nature of 
the codons, and the like. Thus, for example, properties related to the ability of a nucleic 
acid sequence to form secondary structure, e.g., by hybridization of subsequences of the 
nucieic acid sequence, can he selected as binary choice parameters, For example, the 
5 self^-pairing of a nucleic acid sequence could be greater in, e.g., a highly UA (or GC> 
rich region of the nucleic acid, while a nucleic acid which is not UA (or GC)-rich would 
be less prone to self-pairing. 

Other exemplary binary choice parameters incl ude the size of the nucleotide 
bases (e.g., pyrimidine vs. purine), H-bonding qualities due to H-bond donor or acceptor 

10 siifastituents (e«g,> amino vs, keto-conteining nucleotide bases), and the like. 

Binary choice parameters can also be related to selected properties of codons, 
including the relative enthalpy of eodon-antieodon interactions (which can include the 
relative enthalpy of the interaction of a codon with its antieodon plus the flanking 
complementary bases, e.g., the relative enthalpy of pentamers with their antiparallei 

1 5 complements; the ability of a codon to be "read" by a tRNA (which can be related to 

eodon-anlicodon interaction enthalpy, size, polarity, and the like), and other such codon- 
level parameters. However, a codon-Ievei parameter is not a function of the amino acid 
encoded by the codon. 

20 Cofflgosito 

Compositional binary choice parameters can be selected from observed 
frequencies of certain codon groups in one message reading frame and/or correlations 
among frequencies of particular codon groups in two different reading frames of the 
same message. Compositional choice parameters include those derived from enthalpic 

25 and statistical analysis of mRNA pentamers; compositional choice parameters also 

include any derived from energetic and statistical analysis of mRNA n~mers (he., n > 3). 
where such analyses can be shown to yield constant intra- and inter-frame frequencies of 
particular codon groups, 

30 

App lication of Binary Choice Parameters 

The application of a first binary choice parameter, e.g., with choices a and b y will 
structure the triplets into classes (or subcodes) a and k The application of a second 
choice parameter, with choices c and d, will structure the triplets into classes ac, ad, be, 
35 and bd. Application of a third binary choice parameter would structure triplets into 2 3 or 
eight subcodes. Thus, application of n binary choice parameters to the genetic code will 
result in the formation of a binary choice alphabet having 2 n classes or subcodes. It is 
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possible that some subcodes wUi be empty when the binary choice alphabet is applied to 
a given nucleic acid sequence. 

The binary choice parameter can be applied directly to a subject triplet to assign 
triplets into a class. For example, the binary choice parameter can be based upon 
5 relati ve enthalpy of a codon-anticodon interaction (e,g,, the codons are divided into 
group(s) of eodons having high relative enthalpy and group(s) of eodons having low 
relative enthalpy ) and that parameter applied to a subject codon such as 234. A subject 
triplet can also be assigned a class by a method in which bases are not in the subject 
triplet, or which do not correspond exactly to Che bases of the subject triple. E.g., the 

1 0 binary choice parameter can be applied to one or more base pairs which do not define the 
triplet, B,g, in evaluation of triplet 234, the binary choice parameter can be applied to 
triplet 123 and triplet 345, and the classes into which the triplets 123 and 345 fell can foe 
used to assign a class or subcode to the triplet 234 , In other words, the subcode of 234 
can be a function of the application of the binary choice parameter to the triplets 123 and 

15 345. 

Frame Choice 

Methods of the invention require the division of a sequence of bases into triplets. 
The simplest way is to consider a string of bases, 123456789, as triplets of 123 456 789. 

20 Mechanistically, this or any mode of division into triplets can be viewed as a process 
with two components, a "ratchet" or advance component and a "read" or selection 
component. As will be seen below, the ratchet component varies by the number of base 
pairs advanced after the determination of a triple. 

Tire read component refers to the length in base pairs, of the segment of base 

25 pairs from which the triplet will be chosen. 

The simplest system, that used by most evolutionariiy current cellular 
mechanisms, is "ratchet three/read 3*' (that is. the mRNA is advanced, or ratcheted, 
through the reading mechanism three bases at a time, and the message is read by the 
reading mechanism in groups of three bases (one codon). Other systems, however, are 

30 possible. Without being bound by theory, it is postulated that other systems may have 
existed in earlier stages in the evolution of the cellular protein translation machinery. In 
fact, examples of current frame-shift repressing tRNA\s are known. Thus, possible 
alternate systems include "ratchet 3/read 5 on center" (in which the m RISE A is ratcheted 
into the reading mechanism three bases at a time, and the reading mechanism reads the 

35 group of three bases at the center of a group of five bases in the reading mechanism). If 
a read value is more than 3 (e.g, ? in a "ratchet 3/read 5 on center" system), then 
additional choices are imposed: the triplet must be selected from the 3*t*N bases which 
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are read. Thus, a string 1 2 3 4 5 6 7 8 9 10 can be divided into the following triplets: 
234 1 567 1 8910, which would be generated by reads 12345 \ 45*578 ! 7*9/01 3 , wherein 
the italicized bases, the on center bases, ere chosen. 

For example, read-ratchet mechanisms or configurations can be divided into the 
5 following classes; 

Class I ; ratchet 3; read 3 

Class 2 : ratchet 3; read 5 and select the center 

triplet, 12345 
Class 2a: ratchet 3; read 5 and select the leading 
10 triplet, 12345 

Class 2b: ratchet 3; read 5 and select the final 

triplet, 12345 
Class 2c: ratchet 3; read 5 and read any triplet 
Class 3 : ratchet 3; frameshift; read 5 ; any triplet 

15 

Class 2 approaches allow the assignment of a binary choice parameter to a codon 
234 as a function of the binary choice parameter outcome for one or both of 123 and 
345, e.g., 1 23 is classified as UG (k) or AC- (a) rich; and 345 is classified as UG (k) or 
AC-(a) rich, which gives the following possible classes for 234; fck, ka, aa, ak. if, for 

20 example, 123 is k, and 234 is a, then 234 is ka. Note that although only one binary 

choice is applied, there are 4 degrees of freedom with regard to 234, because the binary 
choice parameter is applied twice. 

A binary choice parameter which divides triplets into classes on the basis of 
enthalpy, e.g., of the codon-anticodon interaction (e.g., into enthalpically strong and 

25 enthalpically weak classes) is particularly useful 

Read-ratchet configurations wherein the read value is greater than 3 make 
possible the context-sensitive fas opposed to context-free) assignment of triplets into 
classes by binary choice parameters, e,g,, allow triplet 234 to be assigned a value which 
is a function of the binary choice parameter outcome of one and, more preferably both, 

30 of 123 and 345. 

Binary Choice Alphabets 

A binary choice alphabet can be constructed by selection of suitable pre-sekcted 
binary choice parameters. For example, binary choice parameters corresponding to 
35 enthalpy (of codon-anticodon interaction), size, polarity, charge, hydrophobicity, etc, can 
be selected and combined in any desired combination to arrive at a binary choice 
alphabet. Alternatively* a binary choice alphabet can be constructed by segregation of 
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eodons into 2° classes, without selecting the groups based on binary choice parameters. 
For example, a computer cm rapidly segregate codons into randomly-selected classes to 
create a binary choice alphabet, 

However the binary choice alphabet is constructed, it is generally preferable to 
5 validate the alphabet to ensure that the alphabet will be predictive of protein structure, It 
has been found that, hi preferred embodiments, valid binary choice alphabets having 2 n 
classes will generally include at least four classes (A, B ; C, D) for which the following 
relationships are true when triplets of a nucleic acid sequence are parsed with the binary 
choice alphabet: the ratio A:B:C:D is about 3:5:3:5 (e.g., from about 2:4:2:4 to about 

10 7:11:7:11); the ratio (A+B)/(CfD) is about 1 (e.g., from about 03 to about 1.1); and the 
ratio (A+D)/(B+C) is about 1 (e.g., from about 0*9 to about LI). Thus, in preferred 
embodiments, application of a binary choice alphabet to a nucleic acid sequence which 
encodes a protein will yield at least four classes in which triplets are arrayed according 
to these ratios, it is therefore possible to validate a binary choice alphabet by searching 

1 5 for the appearance of the desired ratios. If the ratios are found, then the alphabet may 
have predicti ve value for protein structure evaluation. If the ratios are not found, the 
alphabet may not have such predictive value, it will be appreciated that the presence or 
absence of the ratios provides a useful "check" for a selected binary choice alphabet. In 
preferred embodiments, regions of triplets are checked for lengths which are multiples of 

20 16 (e.g., 3+5+3*5^16), such as 16 triplets, 32 triplets, 64 triplets, and the like. Thus, in 
a preferred embodiment, groups of N x 16 sequential triplets (wherein N is an integer, 
e.g., between 1 md 16) are evaluated to determine whether the desired ratios are present 

Another means for validating a binary choice alphabet is by comparing the 
frequency of codon groups when the message is read in one reading frame (e.g., Frame 

2 5 1) with the frequency of the same codon groups when the message is read in another 
reading frame (e.g.. Frame 3). ft has also been found thai valid binary choice alphabets 
will generally include at least four classes A, B ? C, D such that the frequency of codons 
A, B, C, D varies systematically from one frame to another, e.g., from Frame 2 to Frame 
1, Frame 2 to Frame 3, and/or Frame I to Frame 3. It is therefore possible to further 

30 validate a binary choice alphabet by searching for a systematic inter-frame variation in 
the frequencies of codon groups defined by the alphabet. If such systematic variation is 
found, the alphabet may have predictive value. One of ordinary skill in the art, in light 
of the teachings herein, will be able to select useful binary choice alphabets according to 
these criteria using no more than routine experimentation. 
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Exam ple I : Generation of a predictive amino acid alphabet based on binary choices 
5 which are a function o f enthal py of codon-antieodon interaction 

This example provides a predictive four letter amino acid alphabet {4a^) for the 
representation of protein primary structures (s* s ) from the energetic properties of mRNA 
molecules, Le, 7 tl^c translation of mRNAs. While not wishing to be bound by theory, the 
basis for deriving an amino acid alphabet from codon-anticodon interactions can be 

10 rationalized as follows: if the genetic code was not "frozen" prior to the onset of 
translation and the evolution of protein primary structure, then the evolutionary 
trajectory of this code may have been one factor which determined important properties 
of protein primary structure. Energetics of eodon-anlicodon interactions may have been 
relevant to the evolution of the genetic code before ribosomes existed, when these 

1 5 interactions occurred in an aqueous medium. 

The configuration of the reading frame may also provide a basis for deriving an 
amino acid alphabet. Figure 1 schematically depicts two alternative reading frames for a 
nucleic acid sequence, each reading frame defining an energetic packet or triplet; each 
nucleic acid base of the message is represented by a black square. Again, while not 

20 being bound by theory, evolution may have favored systems which would allow slippage 
from frame 1 to frame 2, This would impose entropic requirements on the code. It is 
noted that this in system, which permits "slippage", energy packaging may be analogous 
to human linguistic systems which permit slippage and routines for assigning syllabic 
stress (f0) and consequent systematic recasting of signifying sound tokens. For 

25 comparison, refer to Grimm's Law and Verner's Law for Indo-European. 

It is shown herein that the energies of codon-anticodon interactions pattern 
systematically, that this pattern implicitly defines a particular amino acid alphabet, and 
that this amino acid alphabet characterizes protein primary structure predietively, Le,, 
provides insight into protein secondary and tertiary structure. 

30 Table { A shows the Omstein-Fresco AH values for the 1 0 possible base pair 

overlaps. Using those AH values, s average AH values were calculated for all 64 possible 
codon-antkodon interactions for all possible mRNA pentaniers (or "five-envelopes") 
with codons as the center three bases, The average AH values shown in Tables IB and 
IC assume that there is no wobble pairing of codon and antieodom The average AH 

3.5 values in Tables IB and C were calculated according to the following formula: for any 
pentamer ABCDE: AH is calculated for B> C D according to the formula: (AII(AB) *AH 
(BC) -f AH (CD) + AH (BE)) / 4. The 64 eoden triplets are shown in the first column of 
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Tabic IB, Values for a codon in each of all possible five-en velopes are shown in each 
row. For example, in the case of UUU, the enthalpy value for a UUU codec preceded by 
a U and followed by a U is 2MQ. The enthalpic value when UUU is preceded fay a U and 
is followed by a C is 2.45, T he average value for a codon in all possible "five- 
5 envelopes" is given m the penultimate column on the right side of the table. For the 
UUU codon, the average for all possible 5 envelopes is 2,43. That average is calculated 
for all codons in Table IB. The final column (far right) of Table IB provides the average 
enthalpic value for ail codons having a common leading doublet. For example, ah 
codons which begin with the doublet UU have an average enthalpic value of 2-1 1 « 

1 0 Table tC shows the values from the penultimate column of Table IB. Note that 

the values in Table IC hover around four values, 0,6, 1 ,2, 1 .8, and 2 A It can also be 
seen, as indicated in the caption of Table IC; that for any gi ven doublet XX, the average 
enthalpic value for the codons XXU and XXA is about 0.6 higher than the average value 
for the codons XXC and XXG. 

15 The energetic pattern evident in Table IC manifests itself in mRNAs. Table HA 

shows 16 enthalpically defined codon groups (separated by dashed lines) produced by 
ranking the codons according to the interaction AM of the leading doublet, that is, the 
first two base pairs of the codon, and by the codon interaction enthalpy value from Table 
IC, In Table IIA the first column shows ail codons. The second column identifies the 

20 first doublet in the third bases of the codon. The third column provides the Al l of the 
first doublet* the fourth column provides the main codon Ml over all 16 possible 
pentameric envelopes (as set out in Table IB, penultimate column) and the fifth column 
provides a letter for a group designation. The horizontal divisions segregate the first 
doublets according to the eight energy levels shown in Table I A, Each of the groups 

25 thus formed by horizontal division is further subdivided on the basis of the average value 
for the codon for each of the 5 possible envelopes for Table IB and by which of the 4 
energy levels identified in Table IC it falls into, Table UB is analogous except that the 
first binary choice applied is the Al l for the second or final doublet of the codon. 

Table I1C shows the frequency of Table II A, or leading, codon groups and of 

30 Table IlfL or following codon groups in a test mRNA database. 

The leading or L codon groups of Table 1IC correspond to frame 1 of the mRNA 
and the final or F cod on groups in Tabl e IIC correspond to frame 3 of the mRNA, T he 
middle column of Table IIC shows the difference in frequency between the L groups and 
foe F groups shown in the first and last columns of Table HC, It can be seen that the 

35 differences are very small, which may be a consequence of an original evolutionary 
pentameric energy packaging scheme. 
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One possible explanation for this conserved "epiphenomenon" is that the present 
day "ratchet 3/read 3" translation system evolved from a -'ratchet 3/read 5 on center" 
primordial translation system. Present day "frame shift suppressor' 1 tRNAs with 
amicodon loops greater than 3, are possibly mutant analogs of ancestor tRNAs which 
5 regularly read pentamers. According to this view, as ratchet 3/read 3 translation systems 
evolved from ratchet 3/read 5 ancestral translation systems, mRN As would have had to 
he repackaged in one of two alternative reading frames different from the original 
reading frame. For example, an original evolutionary ratchet 3/read 5 on center system 
would read pentamer 12345 as 234. This corresponds to present day frame 2. However, 

10 a ratchet 3/read 3 translation system reading from that same pentamer 12345, would read 
123, corresponding to present day reading frame U or else it would read 345, 
corresponding to present day reading frame 3 . It is believed that the prevalence of the 
"weak* bases U and A at the 5* ends of the anticodon loops of tRNA pentamers would 
favor repackaging of codons into present day frame 3 rather than into present day frame 

15 3, 

If such mi evolution from a ratchet 3/read 5 on center to a ratchet 3/read 3 
translation system occurred, the resulting frameshift from reading frame 2 to reading 
frame I would have the potential to cause disastrous changes in protein structure as the 
alternate reading frame was read. There are at least two ways in which catastrophic 

20 mutations could be avoided. First, if the pentamer packets of the earliest mRN As were 
read "loosely" by the earliest tRNA anticodon, that is, if early tRNAs could read either 
123, 234, or 345 out of each pentamer, then the loose reading would result in 
evolutionary pressure to select mRNAs containing packets which would riot introduce 
harmful amino acids into protein primary structures when the packets were read 

25 differently, e.g., when the packets were read in frame 1 rather than in frame 2. Second, 
if the mRN As were so selected from the start then a systemic frameshift would not 
necessarily introduce harmful amino acids into protein primary structures in numbers 
sufficient to damage structure and/or function of the protein, and in fact might permit the 
introduction of novel amino acid sequences with beneficial effects on protein secondary 

30 and tertiary structures. 

This suggests that if a systemic frameshift occurred, some codon distributions 
would have remained essentially unchanged ("constant" codons) while other codon 
distributions would have changed ("wild card"), which could have a beneficial effect on 
protein structure, in this case, the evolutionary distinction between "wild card n and 

35 "constant" codons might classify amino acids in such a way as to enable the construction 
of a predictive amino acid alpha bet, Accordingly, a binary choice alphabet was created 
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m which the "constant" vs. "wildcard** distinction was one binary choice parameter 
(Figure 2), 

Table IHA sho ws possible enthalpic groups of leading and final triplets in rnRNA 
pentamers with the 64 codons as centers. An example is shown in Figure 2, in which the 
5 codon UUA is the center triple. The first column of Figure 2 shows the four possible 
leading L triplets together with the classification group from Tabic HA in the second 
column. The fourth column of Figure 1 1 shows the classification group of the final (F) 
triplets sho wn in the last column of Figure 1 L 

As shown in Table III A, doublets can be classified as "constant codon doublets" 

10 or " wild card codon doublets'*. A constant codon doublet is a doublet XX of a codon 
XX Y or XXR (Y and R. stand for a pyrimidine base or a purine base respectively), in 
which XX is UU, CC, GCK or AA« for which codon, as shown in Table HI A, the leading 
(NXX) and final (XYN or XRN) triplets of all possible pentamers (N is any base), 
belong to the same enthalpic groups of Tables H A and fIB> For example, fox the codon 

15 UUA (boxed line at upper left of Table HI A), the four possible leading triplets (NUU) all 
belong to the groups Z and W, .The four possible final triplets (UAN) also all belong to 
the groups Z> W ? and X. Because U 5s a pyrimidine (Y) and A is a purine (R), UU A is a 
constant codon doublet of class YXR. A "wild card codon doublet", in contrast, shows 
an alternation between enthalpic groups of Tables II A and IIS as the leading and final 

20 triplets are analyzed over all pentamers. For example, for the codon UUU (top line at 
upper left of Table IliA), the four possible leading triplets (NUU) belong to the groups 
Z, W and X, as noted above, The four possible final triplets (UUN) belong to the groups 
Z, V, Y s and U, differing from the leading triplets, Because U is a pyrimidine (Y), UUU 
is a constant codon doublet of class YXY. 

25 The distinction between constant codon doublets and wild card codon doublets 

can be used to construct a four letter amino acid alphabet, As shown in Figure 3 f the 64 
codons can be divided into four groups: constant Y, X, R ? doublets, constant R, X, Y 
doublets, and wild card Y, X, Y s doublets, and wild card R ; X ? R doublets. 

As shown in Figure 4, a test mRNA database was analyzed 10 determine the 

30 frequencies of the four codon groups in the four tetter amino acid alphabet of Figure 3. 
The mRNA database was read in both frame 1 and frame 2. As can be seen from Figure 
4, shifting from reading in frame 2 to reading in frame i results in the interchange of 
frequencies of p and s. 
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E xampl e 2: Detsimf nation of Secondary and Tertiary Protein Struct ural Features 
Co ggjjjted .With Message Segments Evalua ted With a Binary Choice Alphabet 

A binary' choice alphabet of Example I (s, p, d, t) was used to evaluate protein 
5 structures as follows: 

Test mRNA sequences were analyzed from a database of mRNAs (e.g., from 
GenBank). Note that in GenBank, uracil (U) is stored as "T"; this convention will be 
used throughout this example, Each sequence was then analyzed in reading frame 2 
using the following mapping: 
10 ATT/ATC/GTT/GTC-A 
ACT/ACC/GCT/GCOB 
AAT/AAC/GAT/GAOC 
AGT/AOC/OOT/QGOD 
TTA/TTG/CTA/CTG-E 
1 5 TCA/ICG/CCA/CCG-F 
TAA/TAG/CAACAG-G 
TGA/TGG/CGA/CGG-H 
TTE'TTC/CTT/CTOI 
TCT/TCC/'CCT/CCC-J 
20 TAT/TAC/CAT/CAC==K 
TGT/TGCCGT/CGC-L 
ATA/ATG/GTA/GTC-M 
ACA/ACG/GCA/GCG-N 
AAA/AAG/GAA/GAGO 
25 AGA/AGG/GGA/GGG-P 

One binary choice parameter was whether the leading base of the triplet was 
purine (A or G; groups A-D and M-P) or pyrimidine (T or C; groups E-L). The other 
binary choice parameter was the '' wildcard" vs. "constant" distinction discussed in 

30 Example 1, infra. It should be noted that this parameter also corresponds to a binary 
choice between "symmetrical" (YXY and RXR) codons vs. "non-symmetrical" (YXR 
and RXY) codons (in which Y and R are pyrimidine and purine as defined above). 

The mapped string from reading frame 2 was then converted to the binary choice 
alphabet (s, p, d, t) according to the following scheme: 

35 ABCDEFGHIJKLMNOP=ssssppppddddtttt. The result is a binary choice alphabet of 
degree 2, dividing the genetic code into 4 classes (denoted s, p, d s t), as shown in Figure 
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The mapped string was then evaluated, over a moving window of 1 6 triplets (16 
letters in the spdt alphabet), to determine regions in which the s:p:d:t ratio was about 
3:5:3:5 in reading frame 2 (that is, s >^ 2, p >=» 4, d >= 2, t >■-- 4), When such a region 
was found* the mRNA sequence was translated to an amino acid sequence in frame J for 

5 that region of the mRNA (i.e., by reading the message resulting from adding a base at 
the beginning and eliminating a base at the end of the message segment). Our protein 
database (described infra) was then searched for proteins which included the amino acid 
sequence encoded by the resulting Frame 1 amino acid sequence. When a single protein 
was found to have two separate and distinct regions with even low homology to the 

10 derived frame 1 amino acid sequence, the two regions were often found to have similar, 
or virtually identical, secondary and tertiary structural features. When two different 
proteins were found, which each manifested one or more regions with even low 7 
homoiogy to the derived Frame I amino acid sequence, these regions were often found 
to have very similar secondary and tertiary structural features, 

IS 

Exam ple 3: Starting , from a Known Prot ein Structure 

Binary choice alphabets (s, p, d f t) were used to evaluate protein structures as 
follows: 

Test mRNA sequences were read from a database of mRNAs (e.g., from 

20 GenBank), Each sequence was then read in reading frame 1 and in reading frame 2 
using the mapping described in Example 2 for the 16-fetter alphabet A~P. 

The mapped string from reading frame 1 was then convened to a binary choice 
alphabet (s, p, d, t) according to the following scheme: 
ABCDEFGHOKLMNOP-ppppsHssttttddd 

25 The mapped string from reading frame 2 was then converted to a binary choice 

alphabet (s» p> t) according to the following scheme: 
ABCDEFGHIJKLMNOP-ssssppppddddlttt 

The mapped strings were then evaluated, over a moving window of 16 triplets 
(16 letters in the spdt alphabet), to determine regions in which the s;p:d;t ratio was about 

30 3:5:3:5 m both frame 1 and frame 2, When such a region was found, the mRNA 

sequence was translated to an amino acid sequence in both frame 1 and frame 2 for that 
region of the mRN A, Our protein database was then searched for proteins which contain 
the amino acid sequence encoded by the translated region of Frame 2, The database of 
protein messages contained messages for three hundred proteins, those pro teins being 

35 sixty to six thousand amino acids in length, The proteins included proteins with roles in 
protein synthesis, nucleic acid synthesis, protein or nucleic acid degradation, various 
"house-keeping" enzymes, and some immunoglobulins- When a protein containing the 



sequence was found, the structural similarity (e,g., the tertiary structure) of that portion 
of the protein was compared to the structure of the protein encoded by the test mRNA 
sequence. 

It was found that for se veral test mRNA sequences, many of those portions of the 
5 identified proteins were structurally very similar to the comparable portions of the 
protein encoded by the test mRNA sequence. For example, a helix-strand transition in 
the protein encoded by the test mRNA sequence was structurally similar to a helix-strand 
transition of a protein located in the protein database according to methods of the 
invention, Application of the methods of the in vention (e*g., the methods of Example 2 
10 and Example 3) to a variety of test sequences identified structural similarity in at least 
one protein of the our protein database for other structural motifs such as sheets, helix 
entry, helix exit, Pro-His-Pro turns, and the like. 

Example 4 

1 5 1 he function of introns (e.g., non-coding DNA sequences in genomic DM A) is 

generally not well understood. Methods of the invention provide knowledge which is 
useful for investigating intron function. The methods of the invention can include 
searching nucleic acid databases (e,g>, of genomic DNA) for regions of nucleic acid 
which do not code for protein in the present-day reading frame (i.e., frame 1), but which 

20 could code for protein in an alternate reading frame (e.g, s frame 2 or frame 3). Such a 
presently non-coding region (i.e. , an intron) conid correspond to a region of a nucleic 
acid which was a coding region prior to a f Vameshift Such formerly-coding regions 
could encode alternate structures (i.e., protein regions which differ from the modern 
protein regions) which preserve the function of the protein. 

25 Thus, a nucleic acid which represents both coding and non-coding regions can he 

analyzed in both frames 1 and 2> as described supra for Examples 2 and 3, Where a non- 
coding region, such as an intron, is found in which the s;p:d:t ratio is about 3:5:3:5 in 
frame 2, that region may correspond to a region of the nucleic acid which coded for 
protein structure prior to a shift in reading frame. 

30 

Equivalents 

Those skilled in the art will recognize, or be able to ascertain using no more than 
routine experimentation, many equivalents to the specific embodiments of the invention 
described herein. Such equivalents are intended to be encompassed by the following 
35 claims. The contents of all references cited herein are hereby incorporated by reference. 

What is claimed is: 
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1 . A method of evaluating protein structure comprising: 
providing a nucleic acid sequence which encodes the protein structure; 
assorting bases of the nucleic add sequence into subject triplets; and 
assigning one or a plurality of subject triplets to one of a plurality of classes, 

5 wherein the assignment is a function of classifying triplets, of the nueieic acid sequence 
as members of a class of a binary choice alphabet of a degrees of freedom, and wherein 
the classes can be generated by applying n binary choice parameters to a triplet to yield 
at least 2* classes of suhieet triplets, wherein a binary choice parameter is a taction of a 
message-level property of the nucleic add sequence, 
1 0 thereby eval uating the protein structure. 

2, The method of claim I , farther comprising making a record, on a machine 
readable medium of the class assigned to one or more triplets. 



I S 3 . The method of claim 1 > wherein triplets are assigned to a first and a second 

class: 

die first class having the property that a message made of triplets drawn 
exclusively from the first class is less likely to form secondary (mtrachain) structure than 
is a message which is made of triplets from both the first class and the second class of 
20 triplets, and 

the second class having the property that a message made of triplets drawn 
exclusively from the second class is less likely to form secondary (intrachain) structure 
than is a message which is made of triplets from both the first class and the second class 
of triplets, 

25 

4. The method of claim l> wherein the message-level property is: a function of 
the UA content of a subject triplet; a function of the GC content of a subject triplet; a 
function of the size or molecular weight of a triplet; a function of whether the triplet is 
keto rich or amino rich; a function of whether the triplet is purine rich or pyrimidine 
30 rich: or a function of a the enthalpy of the interaction between the triplet and a frilly or 
partially complementary nucleic acid. 



5 , The method of claim U wherein a subject triplet 456 of a nucleic acid 
sequence of bases 123 456 789 is assigned into a class as a function of 
35 (i)perfom (hi) 

(i) applying a binary choice parameter to a leading triplet of 456, e,g>, to one or 
more of triplet 123, 234, or 345, to yield a leading value; 
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(ii) applying a binary choice parameter to 456, to provide a center value; 

(in) applying a binary choice parameter to a following triplet of 456, e.g.* to one 
or more of triplet 567, 678, or 789, to yi eld a following value; 

(2 ) assigning one or a plurality of subject triplets 345 into a class based on the 
5 values determined in one or more of (3 ) arid (3), 

thereby assigning one or a plurality of subject triplets into classes. 

6. A class-constant table of nearest neighbor relationships for amino acid 
residues which provides, for each of a plurality of class constant nearest neighbors, a 

1 0 frequency of occurrence which is a functi on of the occurrence of the class constant 
nearest neighbor pair in a collection o f at least 10 proteins, 

7. A method of eval uating a protein structure comprising: 

providing a class-constant table of nearest neighbor relationships for amino acid 
15 residues; 

providing a nucleic acid which encodes a protein structure; and 

comparing one or a plurality of the observed nearest neighbor pairs in the protein 

structure with the frequencies provided by the class constant table, thereby evaluating 

the protein structure. 

20 

8* The method of claim 7 wherein the comparison can include: assigning an 
expected frequency from the class constant table to one or a plurality of the observed 
nearest neighbor pairs and determining how many of the observed nearest neighbor pairs 
fall above or below a predetermined value; determining the likelihood of occurrence, as 
25 predicted by the class constant table, for an observed nearest neighbor pair; or 

determining if an observed nearest neighbor pair of a first and a second amino acid 
residue from the protein structure is predicted by the class constant table to occur at a 
predetermined frequency, 

30 9. The method of claim 7, further comprising making a record of observed class 

constant nearest neighbors in the protein structure on a machine^readable medium, 

10, A machine-readable medium on which is recorded a class-constant nearest 
neighbor tabk. 



35 



1 L A method of evaluating a protei n structure for resistance to change, e>g,, 
evolutionary or mutational change comprising: 
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identifying regions of a protei n which is encoded by runs of a single subcode, 
thereby identifying regions which have been resistant to change and which are therefor 
predicted to be functionally or structurally significant. 

5 12, The method of claim 1 1, wherein the method includes determining if the 

nucleic acid sequence which encodes the protein structure includes a run of triplets at 
least 40 triplets in length, in which at least 90% of the tripl ets in the run are from one 
ekss, 

10 13, A method of evaluating a protein siniciure for the presence of critical amino 

acid residues comprising: 

identifying critical amino acid residues by identifying minority codons in runs 
encoded by codons of a single class or subcode, thereby identi fying residues which have 
been resistant to change and which are therefor believed to be functionally important 

15 

14* T he method of claim 13, wherein the evaluation comprises identifying a 
triplet from a first class in a run of triplets of a second class at least 40 codons in length, 
in which at least 40% of the codons are from the second class, thereby identify ing the 
triplet of the first class as encoding a critical residue. 

20 

1 5. A method for evaluating a protein structure comprising: 
providing a nucleic acid sequence which encodes the protein structure; 
assorting bases of the nucleic acid sequence into subject triplets; and 
assigning at least one of the subject triplets to one of a plurality of classes, 

25 wherein the assignment is a function of classifying the subject triplets of the nucleic acid 
sequence under a binary choice alphabet of n degrees of freedom by applying n binary 
choice parameters to a triplet to yield at least 2 n classes of subject triplets, wherein the 
assignment provides at least four cl asses of triplets , the at least four classes of triplets 
being represented in at least a portion of the nucleic acid sequence m a ratio of about 

30 3:5:3:5; 

thereby evaluating the protein structure, 

16. The method of claim 15, wherein the method includes making a record on a 
machine-readable medium of the class assigned to one or more triplets, 

35 

1 7. A method for identifying coding regions of a nucleic acid sequence, the 
method comprising: 
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providing the nucleic acid sequence; 

assorting bases of at least a portion of the nucleic acid sequence into a plurality 
of subject triplets: 

assigning the plurality of subject triplets to om of a plurality of classes, wherein 
5 the assignment is a function of classifying the subject triplets of the nucleic acid 

sequence under a binary choice alphabet of n degrees of freedom by applying n binary 
choice parameters to a triplet to yield at least T classes of subject triplets, wherein the 
assignment provides at least four classes of triplets A, B, C, and D; 

determining whether the plurality of subject triplets are distributed into the at 
1 0 least four classes of triplets A:B:C:D in a ratio of about 3:5:3 :5; 

thereby identifying coding regions of the nucleic acid sequence. 

19, The method of churn 1 ?> wherein the method includes making a record on a 
machine-readable medium, of the class assigned to one or more triplets, 

15 

20. A method for identifying a protein that includes a polypeptide portion which 
is structurally or functionally similar to all or a portion of a test protein, the method 
comprising; 

providing a nucleic acid sequence which encodes all or a portion of the test 
20 protein; 

assorting bases of at least a portion of the nucleic acid sequence into a plurality 
of subject triplets in a first reading frame; 

assigning the plurality of subject triplets m the first reading frame to one of a 
plurality of classes, wherein the assignment is a function of classifying the subject 
25 triplets of the nucleic acid sequence under a first binary choice alphabet of n degrees of 
freedom by applying n first binary choice parameters to a triplet to yield at least 2 n 
classes of subject triplets, wherein the assignment pro vides at least four classes of 
triplets distributed in a ratio of about 3:5:3:5; 

assorting bases of the at least a portion of the nucleic acid sequence into a 
30 plurality of subject triplets in a second reading frame; 

assigning the plurality of subject triplets in the second reading frame to one of a 
plurality of classes, wherein the assignment is a function of classifying the subject 
triplets of the nucleic aci d sequence under a second binary choice alphabet of n degrees 
of freedom by applying n second binary choice parameters to a triplet to yield at least 2* 
35 classes of subject triplets, wherein the assignment provides at least four classes of 
triplets distributed in a ratio of about 3:5:3:5; and 
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identifying a protein which includes a polypeptide portion encoded by the 
plurality of triplets in the second reading frame; 

thereby identifying a protein that includes a polypeptide portion which is 
structurally or functionally similar to all or a portion of the test protein, 

5 

21 * A method for identifying a mutation-prone region of a viral nucleic acid 
sequence comprising i 

providing the nucleic acid sequence; 

assorting bases of at least a portion of the nucleic acid sequence into a plurality 
10 of subject triplets in a first reading frame; 

assigning die plural ity of subject triplets in the first reading frame to one of a 
plurality of classes, wherein the assignment is a function of classifying the subject 
triplets of the nucleic acid sequence under a binary choice alphabet of n degrees of 
freedom by applying n binary choice parameters to a triplet to yield at least 2 n classes of 
15 subject triplets, wherein the assignment provides at least four classes of triplets 
distributed in a ratio of about 3 :5:3:5; 

assorting bases of the at least a portion of the nucleic acid sequence into a 
plurality of subject triplets in a second reading frame; and 

assigning the plurality of subject triplets in the second reading frame to one of a 
20 plurality of classes, wherein the assignment is a function of classifying the subject 
triplets of the nucleic acid sequence under a binary choice alphabet of n degrees of 
freedom by applying n binary choice parameters to a triplet to yield at least 2* classes of 
subject triplets, wherein the assignment provides at least four classes of triplets 
distributed in a ratio of about 3:5:3:5: 
25 thereby identifying a mutation-prone region of the nucleic acid sequence. 
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Table IA: RNA Doublet Relative AH Enthalpies 
(from Ornstein,Fresco) 



aa/ua 2J6 
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ga/cu 1.41 



«g/ac 1,16 
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cg/gc 0.00 
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Tables XXC 

Comparative Frequencies of "L" Codon Groups 
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Box I Observations where certain daims were found uns&archafcle {Continuation of item 1 of flml sheet) 



This International Search Report has not been established in rospeet of certain claims under Article 1 ?{2){a) for the fcliowog reasons: 



1. j X i Claims No©.: 

because they reiat* to aubjed matter not required to be searched by ibis Authority, namely : 

see FURTHER IfSFORHATION sheet PCT/1SA/210 



Claims Nos.: > u 

became they relate to parts of the International Application that do not comply with the prescribed requtf emercts to such 
an extent thai no meaningful Intemataal Search can be earned out, specif ically 



! 1 Claims Nos.: 

1 — became they are dependent claims and are not drafted in accordance with the second and third sentences of Me 6 4{a). 
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This International Searching Authority found multiple inwrrtlons m this international application, as follows: 



1. 
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searchable claims 
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of any additional fee. 



3. j I As only some of in* reared additional soared fess wera timely paid by the applicant, this International Search Report 
* — * covers oniy these claims for which fees were paid, specEffcafly claims Nos.t 



4 I j Hq required additional search fees were umety paid by the appifoanL Consequently, this intsmationai Search Repon is 
— restricted to the invention first mentioned in the claims; it is covered by claims Nob,: 



Remark on Protest Q The additions* search fees were accompanied by the applicant's protest. 

No priest accompanied th© payment of additional search fees. 
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Remark : Although claims 6 and 10 are directed to a representation of 
information on a carrier, the search has been carried out and based on 
the molecular struucture represented by this inforiiation (Art. 17, Rule 
39. 1 PCt). 



